This course is an open datascience course using Git and R. My github repository is at https://github.com/KatariinaParnanen/IODS-project
This week we have learned about data wrangling using R and dplyr package. We also are introduced to basic regression analysis and model validation using diagnostic plots.
We will first read in data from local directory. The data is based on a questionaire done on students of a course and their exam points.
#Read in table
lrn2014<-read.table("/Users/kparnane/Documents/IODS_course/IODS-project/data/learning2014.txt", header=TRUE, sep="\t")
We will explore the dataset’s dimension and structure.
#Explore the dimensions
dim(lrn2014)
## [1] 166 7
#Explore the structure of the file
str(lrn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
The data has 166 observations and 7 variable columns.
The dataset has a response variable “points”" and explanatory variables “age”, “attitude”, “deep”, “surf” and “stra”. “deep” means points for deep learning questions, “surf” points for surface learning questions, “stra” means strategic learning techniques. “points” means exam score. More info on the dataset can be found here.
We will explore correlations and distributions of explanatory and response factors using “ggpairs” which plots pairwise correlations and variable disturbutions.
#Load required packages
library(GGally)
library(ggplot2)
#Draw plot
ggpairs(lrn2014, mapping = aes(col = gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Pairwise correlation plot: Exploration of variables and correlations
#Get summaries of variables
summary(lrn2014)
## gender age attitude points deep
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00 Min. :1.583
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## surf stra
## Min. :1.583 Min. :1.250
## 1st Qu.:2.417 1st Qu.:2.625
## Median :2.833 Median :3.188
## Mean :2.787 Mean :3.121
## 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.333 Max. :5.000
There are more females than males. Other factors besides age seem to roughly follow a normal distribution. Mean age is 25. We are interested in the response variable points and how the explanatory variables corrrelate with it. The highest correlations are with “attitude”, “stra” and there is a negative correlation with “surf”.
Now we use the three variables with possible correlations with exam points identified in the data exploration in linear models.
# fit a linear model
my_model <- lm(points ~ attitude+stra+surf, data = lrn2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Variable “attitude” is positively correlated with points with a p-value of 1.93e-08 and estimate 0.33952. The other variables are not significantly correlated with exam points so they can be excluded from the model.
Drop the not significant variables and fit the model again.
# fit a linear model
my_model2 <- lm(points ~ attitude, data = lrn2014)
#Plot the regression line in scatter plot with exam points versus attitude
qplot(attitude, points, data = lrn2014) + geom_smooth(method = "lm")
Scatter plot with regression line: Relationship of points and attitude
Print summary of the model.
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is a significant relationship with the variable attitude exam points. The attitude variable has a p-value of 1.95e-09, estimate of 0.35255 and standard error of 0.05674. The R-squared value is 0.1856 meaning that attitude variable explains roughly 19% of the variation in the model.
We will produce diagnostic plots to see if the model fits the assumptions of linear models. The assumptions are that the size of the error is not dependent on the variable value and the errors are normally distributed and that the explanatory variables are not correlated with each other. The dependent variables should also be independent from each other.
# Draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1, 2, 5))
Diagnostic plots: Plots for exploring model errors
There doesn’t seem to be any patterns in the residuals vs fitted values plot so the size of the residuals are not dependent on the fitted values. The residuals in the Q-Q plot follow the assumed regression line quite well so the assumption of normal distribution of errors is filled. The residuals vs leverage plot doesn’t reveal that there are any observations which have a unusually high effect on the model. The model also has only one explanatory variable so explanatory variable autocorrelation is not a problem. The observations are also independent from each other.Thus, we can conclude that the model fits the assumptions.
This week we have learned about logistic regression, piping and model validation.
Read in data and explore.
# Get access to useful libraries
library(dplyr);library(ggplot2);library(tidyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Read in table from local folder
alc<-read.table(file = "data/alc.txt", sep="\t", header=TRUE)
#Explore the data using glimpse
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
# Draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
The data has 382 observations and 35 variables from students whose alcohol consumption has been studied. The variables are either integers, numeric, factors or logicals. More information about the data can be found [here]https://archive.ics.uci.edu/ml/datasets/Student+Performance.
The data includes variables for sex, absences - absences from school, goout - going out and age. Usually there are differences between how much boys and girls drink due to social norms, absences from school also probably correlate with high use of alcohol, going out a lot also usually means higher alcohol consumption due to having more exposure to situations with alcohol and age might have an effect due to easier acccess to alcohol for older kids.
Logistic regression is one type of generalized linear models where the response variable can only have two values for example true or false. In this data set the response variable we are interested is high alochol use, which can have true or false values.
Explore how sex is related to high alcohol use. Our hypothesis is that males drink more than females.
#Explore the distributions of females and males in the data
g0<-ggplot(data = alc, aes(x=sex))
g0 + geom_bar()
#Explore how high use is dependent on sex using barplots
g1<-ggplot(data = alc, aes(x = high_use, fill = sex))
g1 + geom_bar()+facet_wrap("sex")
There are about equal numbers of males and females in the data. There seems to be more males who have high use compared to females, which makes sense if you think about the social norms hypothesis
Explore how absences are related to alcohol use using boxplots. Our hypothesis is that kids who are absent more are more likely to also drink more.
# Explore the distribution of absences
g6<-ggplot(alc, aes(x=absences, fill =sex))
g6+geom_bar() + ylab("absences") + ggtitle("Student absences by sex")
#Draw boxplots with high and low alcohol uses relation with absences
g3 <- ggplot(alc, aes(x=high_use, y=absences, col = sex))
g3 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
Absences are not normally distributed. There is a long tail and the most common value is 0. It seems that there are more absences in the kids with high alcohol use. This makes sense if you think that kids who drink more are also skipping more school. The patterns seem similar between females and males.
Explore how going out is related to alcohol use. Our hypothesis is that kids who are going out more are exposed to alcohol more and thus will also drink more.
#Explore the distibution of kids going out
g4 <- ggplot(alc, aes(x=goout, fill=sex))
g4 + geom_bar()
#Explore the relationship with high alcohol use
g5 <- ggplot(alc, aes(x=high_use, y=goout, col = sex))
g5 + geom_boxplot() + ggtitle("Going out by alcohol consumption and sex")
Going out seems to be quite normally distributed between the values 1 and 5. However, there aren’t very many kids who don’t go out almost at all compared to the ones who go out very often. There is approximately equal numbers of both sexes in each going out class.
There definitely seems to be a relationship between going out and alcohol use. However, there might be some differences between going out often beign correlated with drinking more since in females there is overlap between the ones who go out often but don’t have high alcohol use with those who go out often and have high alcohol use.
Explore the relationship with age and alcohol use. Our hypothesis is that it is easier for older kids to get alcohol than for younger kids.
# Explore the distribution of age
g6<-ggplot(alc, aes(x=age, fill =sex))
g6+geom_bar()
# Explore the relationship of age with alcohol use
g7 <- ggplot(alc, aes(y=age, x=high_use, col = sex))
g7 + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")
Age is not normally distributed and there is quite a long tail for higher numbers after 19. Ages range from 15 to 22 with 16 being the median. For the values 18 and up, there is an approximately equal amount of females and males for each age.
There doesn’t seem to be a very obvious link between alcohol use and age that could be seen from the boxplots since the high use boxplots overlap with the low use bowplots. It might be that it is not very difficult to get alchol in the country of the kids even when you are younger than 18.
Next we will build a GLM using binomial family with the interesting explanatory variables.
# Use generalized linear models to predict high use based on selected variables
m<-glm(high_use~age+sex+absences+goout, data=alc, family = "binomial")
#Print out summary and coefficients
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7877 -0.8026 -0.5393 0.8294 2.5201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.60333 1.84395 -3.039 0.002376 **
## age 0.08951 0.11003 0.814 0.415905
## sexM 0.96338 0.25490 3.779 0.000157 ***
## absences 0.08205 0.02247 3.651 0.000261 ***
## goout 0.71753 0.12058 5.951 2.67e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.09 on 377 degrees of freedom
## AIC: 397.09
##
## Number of Fisher Scoring iterations: 4
#From the summary we can see that sex, absences and going out are significantly correlated with high alcohol use
coef(m)
## (Intercept) age sexM absences goout
## -5.60333157 0.08951186 0.96338497 0.08204599 0.71752626
#From the coefficients we can calculate the odds ratio by taking the exponent function of the estimate
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI<- confint(m) %>% exp
## Waiting for profiling to be done...
# Join the two together
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.003685565 9.268741e-05 0.1296785
## age 1.093640305 8.815211e-01 1.3582243
## sexM 2.620551969 1.599662e+00 4.3546202
## absences 1.085505735 1.040056e+00 1.1371493
## goout 2.049357367 1.626970e+00 2.6130756
From the odds ratio we can see that for every year the is approximately 9% increase in the chances of having high alcohol use, but the confidence interval includes one, which means that age is not significant.
We can also see that sex male means a 2.6 times more likely chance of having high alcohol use. Female is the intercept factor. The CI does not include 1 so this going out is also significant.
One more absence from school causes an approximately 9% bigger chance of having high use and this is significant also the CI doesn’t include 1.
Going out from a scale to 1 to 5 is significantly correlated with high alcohol use as the CI doesn’t include 1. An increasement of 1 causes an approximately 2.6 more likely chance of having high use.
After obtaining the model, we will use cross validation with a training set to estimate the error rate of the model. We will use the model without age since age was not significant.
# Fit the model without age
m<-glm(high_use~sex+absences+goout, data=alc, family = "binomial")
# Predict the probability of high alcohol use
probabilities <- predict(m, type = "response")
# Add the predicted probabilities to the alc table
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high alcohol use as probabilitites instead of odds
alc <- mutate(alc, prediction = probabilities>0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## failures absences sex high_use probability prediction
## 373 1 0 M FALSE 0.14869987 FALSE
## 374 1 7 M TRUE 0.39514446 FALSE
## 375 0 1 F FALSE 0.13129452 FALSE
## 376 0 6 F FALSE 0.18714923 FALSE
## 377 1 2 F FALSE 0.07342805 FALSE
## 378 0 2 F FALSE 0.25434555 FALSE
## 379 2 2 F FALSE 0.07342805 FALSE
## 380 0 3 F FALSE 0.03989428 FALSE
## 381 0 4 M TRUE 0.68596604 TRUE
## 382 0 2 M TRUE 0.09060457 FALSE
# From here you can see than when prob is >0.5 the prediction is "TRUE"
# Cross tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
# From here we can see that out of 65/318 predictions are wrong for predicted low alcohol use and 15/64 are wrong for the prediction of high alcohol use.
# Define a loss function to get the mean prediction error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
# Do K-fold cross-validation using the loss_func defined previously
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2172775
There is approximately 22 % error in the prediction of high alcohol use using the model wiht going out, absences and age.
This week we have learned about clustering, which is a handy tool for visualizing differences between data points. Clustering can be used for example for data exploration. We also learned about classification, which can be used to validate the results of clustering. Discriminant analysis was also touched upon.
We will load the dataset “Boston” from MASS package. The dataset includes data on crime rates per capita for towns in Boston with explanatory variables related to land use and inhabitant demographics.
# Access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
# Load the data
data("Boston")
# Explore the dataset's dimentions and details of the variables
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#L
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)%>%round(digits=2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
There seems to be highest correlations with rad = index of accessibility to radial highways, tax=full-value property-tax rate per $10,000, lsat=lower status of the population (percent) and indus=proportion of non-retail business acres per town variables and lowest with medv=median value of owner-occupied homes in $1000s, black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town and dis=weighted mean of distances to five Boston employment centres. Some of the variables have only very few high or low values which might cause good correlations, but the correlations might not be significant.
Next we will scale the variables using the scale function.
# Center and standardize variables
boston_scaled <- scale(Boston)
# Summaries of the scaled variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Class of the boston_scaled object is matrix
class(boston_scaled)
## [1] "matrix"
# Change the matrix to a data frame
boston_scaled<-as.data.frame(boston_scaled)
# Summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Create a categorical variable 'crime' using the quantiles as break points
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
The values are changed to distances from mean values (centering) and scaling by dividing the centered columns by their standard deviations.
Next we will create training and test sets from the data. We will use 80% of the data for training and then use the rest for testing our model.
# Number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choose 80% of the rows randomly
ind <- sample(n, size = n * 0.8)
# Create train set
train <- boston_scaled[ind,]
Next we will fit a linear disriminant analysis for the data using the lda() command with all the explanatory variables.
# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2376238 0.2351485 0.2648515
##
## Group means:
## zn indus chas nox rm
## low 1.00941458 -0.9100947 -0.12375925 -0.8913658 0.43497420
## med_low -0.05511905 -0.3371740 -0.06727176 -0.5489648 -0.15491043
## med_high -0.37711365 0.2142072 0.22498886 0.3759835 0.08611013
## high -0.48724019 1.0170108 -0.01476176 1.0518366 -0.35823181
## age dis rad tax ptratio
## low -0.8827193 0.8920845 -0.6936710 -0.7559019 -0.42480724
## med_low -0.2916009 0.3408668 -0.5368402 -0.4761418 -0.04537848
## med_high 0.4381160 -0.3853600 -0.4233536 -0.3168772 -0.33585720
## high 0.8056253 -0.8555369 1.6392096 1.5148289 0.78203563
## black lstat medv
## low 0.37609967 -0.76581382 0.49248739
## med_low 0.31485467 -0.13665696 -0.02735169
## med_high 0.07633184 -0.01845665 0.20153186
## high -0.76102195 0.88016998 -0.65879156
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.093098135 0.69138064 -0.89530025
## indus 0.038036163 -0.29931747 0.04548681
## chas 0.002670002 -0.04553786 0.06003566
## nox 0.435725523 -0.61138597 -1.35400776
## rm 0.049064447 0.01428973 -0.21689479
## age 0.226946689 -0.42304811 -0.09127248
## dis -0.036917397 -0.27537666 0.06538790
## rad 3.265787958 0.91138797 -0.26212564
## tax 0.137779609 -0.05444577 0.76157814
## ptratio 0.111327524 0.16380916 -0.19570211
## black -0.086453415 0.03587361 0.12182447
## lstat 0.252679256 -0.14959971 0.39868735
## medv 0.111960879 -0.40699415 -0.19492656
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9562 0.0341 0.0097
From were we see that LDA1 explains >90% of the variance, which means that the model explains a great deal of the variation in the data. In the summary we can also see the means of the explanatory variables for different crime classes. For example rad and tax have quite big differences in their means between the low and high crime rates.
Next we will draw a plot and see how the different towns group in clustering. We will color and annotate the dots with crime rates and draw arrows for directions and weights of the different explanatory variables.
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Target classes as numeric
classes <- as.numeric(train$crime)
# Plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
From the plot it seems that rad (accessibility to radial highways) has the biggest impact on grouping. There are two clusters in the plot with one of them having almost al the highest crime rate towns and some medium high rates but none of the medium low or low crime rate towns. Seems like the model might only need rad to predict the crime rates.
Next we will validate the results using the test set we made earlier. We will predict crime rate classes with test data using the lda.fit model we created with the training set.
# Create test set
test <- boston_scaled[-ind,]
# Save the correct classes from test data
correct_classes <- test$crime
# Remove the crime variable from test data
test <- dplyr::select(test, -crime)
## Predict classes with test data using the lda.fit model we created with the training set.
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 8 1 0
## med_low 3 23 4 0
## med_high 0 12 17 2
## high 0 0 0 20
From the cross tablulation, we can see that 20/20 high crime rates were predicted correctly to be high. However, two med_high cases were incorrectly classified as high. This is pretty good! For the low class, all of the correct lows were predicted to be lows, but 12 were predicted as med_low and one as meg_high.
The model strugles with correct classification of the med_low and med_high classes. Only six of the med_high classes was correctly classified and 20/26 were predicted to be some other class. Most were predicted to be med_low, but two were high and one low.
The model does a bit better with the med_low class. 17/28 are correctly classified as med_low, and five as low and 6 as med_high.
Next we will reload the Boston dataset, scale it and use k-means algorithm to predict the optimal number of clusters in the dataset.
# Obtain dataset
data('Boston')
# Change the matrix to a data frame
boston_scaled2<-as.data.frame(Boston)
# K-means clustering
km <-kmeans(boston_scaled2, centers = 3)
# Plot the Boston dataset with clusters, use first seven variables
pairs(boston_scaled2[1:14], col = km$cluster)
The variables rad and tax seem to cluster the data pretty nicely into two clusters in the crime vs rad and crime vs tax plots. Three clusters might not be needed, but two might suffice. Let’s check this next.
We will determine the number of clusters is to look by how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The best number of clusters is when the total WCSS drops dramatically, which can be observed in the plot.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# Determine the maximun number of clusters
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# Visualize the results with qplot
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# Plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
Like we observed in the previous pairs plot, based on the qplot and WCSS it seems that the optimal number of clusters is 2. Rad and tax produce the best separation betweem the classes.
This week we have learned about dimensionality reduction tehcniques, which is a handy tool for visualizing differences between data points in very few dimensions when the actual data is complex. These methods in clude PCA and CA. Usually results are visualised in two dimensions for the sake of understandability.
First we will load in the human dataset with different variables for related to economy, education and health for countries.
# Read in table
human<-read.table("data/human.txt", header=TRUE, row.names = 1, sep = "\t")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# Access GGally
library(GGally)
# Visualize the variables
ggpairs(human)
# Create summaries of the variables
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Many of the variables have long tails so they might not be normally distributed. Education expextancy seems to be roughly normally distributed.
Mean life expectancy is 71.7 years and education expectancy 13.2 years. There are big differences in the GNI values of different countries. Minimum is 581, median is 12040 and maximum is 123124.
Maternal mortality and adolescent birth rate are strongly correlated with each other and so are life expentancy and education expectancy.
Next we will plot the variables in a two dimensional space using principal component analysis which compresses the variation in dimensions. The dimensions are ranked from one upwards, and the first dimension or principal componen captures the largest amount of variation in the date and the second the secord largest amount of variation.
# Perform principal component analysis on non-standardized data
pca_human <- prcomp(human)
# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1)
# Print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Perform principal component analysis on non-standardized data
pca_human <- prcomp(human)
# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1)
# Print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Next we will repreat the analysis for standardized data.
# Standardize the variables
human_std <- scale(human)
# Print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# Perform principal component analysis
pca_human <- prcomp(human_std)
# Create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2,]*100, digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Draw a biplot
biplot(pca_human, cex = c(0.8, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Seems like the variable GNI (gross national income) is explaining all the variance (100%) and PC2 explains none of the variance (0%) in the not scaled data. The countries with the highest GNI are on the left and ones with the lowest GNI are on the right.
In the scaled data. PC1 explains 53.6% of the variance. Variables which explain changes in the direction of PC1 are education expextancy, gross national income, education rate and life expectancy, which are all positively correlated with each other, as well as maternal mortality rate and adolescent birth rate which are positively correlated with each other but negatively correlated with the before mentioned variables.
PC2 explains 16.2% of the variance and labour rate and percentage of females in the parliament explain variance in this principal component.
Scaling made a big difference in the results since, GNI was driving most of the differences in the data because there where so huge differences in the GNIs of different countries. When we used scaling we could see the effect of other variables as well.
PC1 separarates countries based on GNI, education, life expectancy and maternal mortality and teenage birth rates. Most of the extreme values with low GNI, life expextancy etc, are African countries.
Countries on the opposite scale of the PC1 include many European countries, USA, Japan, South Korea and arab states.
From the scaled figure you can see quite nicely that for example Nordic countries group together. They have quite high life expectancy, GNI, education expectancy, education rate and also labour rate and percentage of females in the parliament. Then there are also countries where the GNI, etc. variables are quite high but the the labour rate and percentage of females in the parliament are low like Qatar, Kuwait, Saudi Arabia, which are oil producing arabic countries, which makes the countries separate on the PC2 axis.
Many African countries are have high labour rates and percentages of females in the parliament, but don’t have that high education, life expectancy and GNI. There is division between African countries based on the second component into countries with low and high high labour rates and percentages of females in the parliament.
First we will load the tea dataset and explore how the data looks.
# Load library and tea dataset
library(FactoMineR)
data(tea)
# Explore tea dataset
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
There are 300 observations and 36 variables. Most variables are factors except age which is an integer. ###MCA analysis Then we will do the multiple correspondence analysis with a subset of the variables
# Access dplyr
library(dplyr)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage="quali")
From the plot you can see that people who drink unpackaged tea usually shop in teashops and people who buy it fro chain stores prefer teabags and people who drink both also shop in both types of shops.